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One often encounters the curse of dimensionality in the application of dynamic programming to determine 

^ , optimal policies for controlled Markov chains. In this paper, we provide a method to construct sub-optimal 

0^ ■ policies along with a bound for the deviation of such a policy from the optimum via a linear programming 

(N ■ 

CO ' approach. The state-space is partitioned and the optimal cost-to-go or value function is approximated by a 



constant over each partition. By minimizing a non-negative cost function defined on the partitions, one can 
construct an approximate value function which also happens to be an upper bound for the optimal value 
function of the original Markov Decision Process (MDP). As a key result, we show that this approximate 
value function is independent of the non-negative cost function (or state dependent weights as it is referred 
to in the literature) and moreover, this is the least upper bound that one can obtain once the partitions 
are specified. Furthermore, we show that the restricted system of linear inequalities also embeds a family 
of MDPs of lower dimension, one of which can be used to construct a lower bound on the optimal value 
function. The construction of the lower bound requires the solution to a combinatorial problem. We apply 
the linear programming approach to a perimeter surveillance stochastic optimal control problem and obtain 
numerical results that corroborate the efficacy of the proposed methodology. 
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1. Introduction 



The Li near Programm i ng fLP) approach t o solving dynam ic 



papers: 



Mannd (|l96d ). 



d'Epenouxl (Il963l ). 



Denardol (|l970l ) 



programs (DPs) origina t ed fro m the 



Hordiik and Kallenberd (Il979[ ). The 



basic feature of an LP approach for solving DPs corresponding to maximization of a discounted 
payoff is that the optimal solution of the DP (also referred to as the optimal value function) is 
the optimal solution of the LP for every non-negative cost function. The constraint set describing 
the feasible solution of the LP and the number of independent variables are typically very large 
{curse of dimensionality) and hence, obtaining the exact solution of a DP (stochastic or otherwise) 



via an LP approach is not practical. Despite this li mitation, an 



method for approximate dynamic programming (iMendelssohn 



1985 



Trick and Zin 



P approach provides a tractable 



1980 



Schweitzer and Seidmann 



19971 ) and the advantages of this approach may be summarized as follows: 

1. One can restrict the value function to be of a certain parameterized form, thereby reducing 
the dimension of the LP to the size of the parameter set to make it tractable. 

2. The solution to the LP provides upper bounds for the value function (lower bounds, if minimiz- 
ing a discounted cost, as opposed to maximizing discounted payoff, is considered as the optimization 
criteria) . 

The main questions regarding the tractability and quality of approximate DP revolve around 
restricting the value function in a suitable way. The questions are: (1) How does one restrict the 
value function, i.e., what basis functions should one choose for parameterizing the value function? 
(2) Are there any (a posteriori) bounds that one can provide about the value function from the 
solution of a restricted LP? If the restrictions imposed on the value function are consistent with the 
physics/ structure of the problem, one can expect reasonably tight bounds. There is another question 
that naturally arises: In the unrestricted case, the optimal solution of the LP is independent of 
the choice of the non-negative cost function. While it is unreasonable to expect that the optimal 
value function be a feasible solution of the restricted LP, one can ask if the optimal solution of 



the restricted LP is the same for every choice of non-negative cost function for the LP. It 



las been 



reported in the literature that this is unfortunately not the case (iDe Farias and Van Rov 



20031 ). 
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If the LP is not prop erly restricted, it can lead to poor approximation and perhaps, even infea- 



sibility ((Gordon 



19991). A common approach is to appro ximate the value (cost-to-go) fu nction 



by a linear functional of a priori chosen basis functions (jSchweitzer and Seidmann 



1985|). This 



approach is attractive in that for a certain class of basis func tions, feasibility of the approximate (or 



restricted) LP is guaranteed (jPe Farias and Van Rov 



20031 ). A straightforward method for select- 



ing the basis functions is through a state aggregation method. Here the state space is partitioned 
into disjoint sets or partitions and the approximate value function is restricted to be the same for 
all the states in a partition. The number of variables for the LP therefore reduces to the n umber of 



partit i ons. State aggregat i on based appr o ximat ion techniques were originally proposed by 



(119831) 



Bean et al 



(119871 ) 



Mendelssohn 



Axsater 



1982). Since then, substantial work has been reported in 
the literature on this topic (see I Van RoyI ()2006l ) and the reference therein). In this article, we adopt 
the state aggregation method. 

Although imposing restrictions on the value function reduces the size of the restricted LP, the 
number of constraints does not change. Since the number of constraints is at least of the same 
order as the number of states of the DP, one is faced with a restricted LP with a large number 
of constraints. An LP with a large number of constraints may be solved if there is a n automatic 



way to separate a non-optimal solution from an optimal one (iGrotschel et al, 



IQSlI ): otherwise. 



one may have to resort to heuristics or settle for an approximate solution. Separation of a non- 
optimal solution from an op timal one is easier if one has a compact representation of constraints 



(Morrison and Kumar 



1999i ) or if a subset of the constr aints that dominate other cons traints can 



easily be identified from the structure of the problem (jKrishnamoorthv et al. 



meth ods include aggregation of constrai nts, sub-sampling of co r istrai n ts (jPe Farias and Van Rov 



n methods ( 


Grotschel and Holland 


(Trick and Zin 


19981). 



1991 



2011bl). Heuristic 



Schuurmans and Patrascu 



If the solution of the restricted LP is the same for every non-negative cost function of the LP, 
then it suggests that the constraint set for the restricted LP embeds the constraint set for the 
exact LP corresponding to a reduced order Markov Decision Process (MDP). If one adopts a naive 
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approach and "aggregates" every state into a separate partition, we obtain the original exact LP 
and clearly, for this LP, the solution is independent of the non-negative cost function. It would 
seem reasonable to expect that this would generalize to partitions of arbitrary size and in fact, we 
prove this to be the case in this article. One can construct a sub-optimal policy from the solution 



to the restricted 



function (Porteu 



P by considering the policy that is greedy with respect to the approximate value 



19751 ). By construction, the expected discounted payoff for the sub-optimal policy 
will be a lower bound to the optimal value function and hence, can be used to quantify the quality 
of the sub-optimal policy. Also the lower bound will be closer to the optimal value function than 
the approximate value function by virtue of the monotonicity property of the Bellman operator. 
But the lower bound computation is not efficient since the procedure involved is tantamount to 
policy evaluation which involves the solution to a system of linear equations of the same size as the 
state-space. In this work, we have developed a novel disjunctive LP, whose solution can be used 
to construct a lower bound to the optimal value function. The contributions of our work may be 
summarized as follows: 

• If one were to adopt a state aggregation approach, then the solution to the restricted LP 
is shown to be independent of the non-negative cost function. Moreover, the optimal solution is 
dominated by every feasible solution to the restricted LP. 



• We also show that considering alternate LP formulati ons via lifting of variab les or by consider- 



20101 ) does not improve 



ing a bigger feasible set via iterated Bellman inequalities (IWang and Boydl l: 
upon the upper bound provided by the restricted LP. 

• A subset of the constraints of the restricted LP can be used for constructing a lower bound 
for the optimal value function. However, this involves solving a disjunctive LP, which may not be 
computationally tractable. 

• We demonstrate the use of aggregation based restricted LPs for a perimeter surveillance 
stochastic control problem. For the application considered here, we show that both the lower 
bounding disjunctive LP and the upper bounding restricted LP can be solved efficiently since they 
both reduce to exact LPs corresponding to some lower dimensional MDPs. 
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The rest of the paper is organized as fohows: we provide a general overview of stochastic dynamic 
programs in section [2] followed by LP preliminaries in section 12.11 In section [Sj we introduce the 
aggregation method and discuss the restricted LP approach that can be used to approximate the 
optimal value function. In the same section, we also present a novel disjunctive LP that can be 
used to compute a lower bound to the optimal value function. We introduce the perimeter alert 
patrol problem in section U] and also elaborate on the efficient LP formulations that arise out of 
the structure in the problem. We corroborate the structure in the perimeter patrol problem via 
numerical results in section [5l Finally, we support the proposed approximation methodology via 
simulation results in section [5.1\ followed by summary in section [6j Supplementary material and 
lengthy proofs, that have been left out of the main body of the paper, for clarity, have been included 
in the Appendix. 

2. Stochastic Dynamic Programming 

Consider a discrete-time Markov decision process (MDP) with a finite state space 5 = {1, 2, ... , \S\}. 
For each state x G 5, there is a finite set of available actions U^- From current state x, taking 
action u^U^ under the random influence Y results in a reward The system follows some 

discrete-time dynamics given by: 



where t indicates time. We assume that the random input Y can only take a finite set of values Yi;l = 
0,...,m and there is a probability associated with each choice pi. State transition probabilities 
Puix,y) represent, for each pair {x,y) of states and each action u (zUx, the probability that the 
next state will be y given that the current state is x and the current action taken is u i.e., 



Any stationary policy, vr, specifies for each state x € 5, a control action u = 7r(x). We abuse notation 
and also write the transition probability matrix associated with policy vr to be P^^, where P.n{x, y) = 
P-K(x){x,y)- Similarly, we express the column vector of immediate payoffs associated with the policy 




(1) 



Pu{x,y) = 



0, if 2/ / f{x, u, Yi) for any / € {0, . . . , m} 
J2jecPo^ where C = {/|y = /(x,u,l^)}. 



(2) 
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TT to be Rt^^ where R-k{x) = Rt^(^^-){x). We are interested in solving a stochastic control problem, 
which amounts to selecting a policy that maximizes the infinite-horizon discounted reward of the 
form, 

oo 

V;(xo)=E J];A*i?,(a;(t)) x(0) = xo 

where A G [0, 1) is a temporal discount factor. We obtain the optimal policy by solving Bellman's 
equation, 

=max| i?„(x) + A I ,V:eG5, (3) 

where, y*{x) is the optimal value function (or optimal discounted payoff) starting from state x. 
The optimal policy then is given by. 



7r*(x) = argmax \ + X^^piV* {f{x,u, l^)) I , Vx G 5. 



(4) 



1=0 



The 



Bellman equation ([3]) can be so . 



19601 ) or policy iteration ([Bellman 



ved u sing standard DP methods such as value iteration ([Howard 



19571 ): however, it is computationally not tractable, if the size 
of state space considered is unmanageably large. For this reason, one is interested in tractable 
approximate methods that yield suboptimal solutions with some guarantees on the deviation of 
the associated approximate value function from the optimal one. 

2.1. Linear Programming Approach 

In this subsection, we briefly touch upon two lemmas that we will use in the subsequent sections. 
Bellman's equation suggests that the optimal value function satisfies the following set of linear 
inequalities, which we will refer to as the Bellman inequalities: 

V{x) > R^{x) + \Y^piV{f{x, u, Yi)), Vn G U^, Vx G S. 



> Ru + XPuV, Vu. 



(5) 



Consider any integer L>1 and for j = 1,2, . . . ,L, let Vj be a vector satis fving a generalization of 



the Bellman inequalities, referred to as the iterated Bellman inequalities (jWang and Bovd 



V,+,{x)> R^ix) + Xj2PiVjifix,u,Yi)), yx,u, Vj = l,2,...,L-l, 



20101 ): 

(6) 
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V,ix) > R^{x) + XY,PiVLif{x,u,Yi)), Vx,n. (7) 

(=0 

Clearly, when L = 1, the above system of inequalities collapses to the Bellman inequalities. The 
iterated Bellman inequalities may be compactly represented as: 

V,+i > Ru + XPuV„ V n, j = 1, 2, . . . , L - 1, 
V,> R., + XR,yL, yu. (8) 

We note that the above set of inequalities have cyclic symmetry, i.e., one gets the same set of 
inequalities by replacing the vectors Vi,V2, ■ ■ ■ ,Vl by V2,V3, . . . ,Vl,Vi respectively. Let vr be any 
stationary policy. Then we have, 

V,+i> R^ + XP^V,, j = l,2,...,L-l, (9) 
V,>R^ + XP,Vl. (10) 

By recursively applying ([9]) to Vl,Vl-i, ■ ■ ■ etc., in (fTO|) . we get, 

[/ - X'^P^]V, > [/ + AP. + • • • + X'^-'Pt']K, V TT. 

By cyclic symmetry, every Vj, j = 2,3, . . . , L, also satisfies the above inequality. 
Lemma 1 . Let the vector V satisfy the following set of inequalities: 

[I-X'^P^]V>[I + XP, + --- + X'^-'P^-']R„, yir. (11) 

Then, we have V >V* . 

Remark 1. We readily see that every feasible solution of the system of inequalities ([5]) or (l8|) 
is lower bounded by the optimal value function V*. By cyclic symmetry, we conclude that every 
feasible V^, j = 1, . . . , L is also lower bounded by V*. 

The following result relates the optimal value function to the optimal solution of an LP with a 
non-negative cost function and constraints of the form given by the Bellman inequalities ([5]) or 
iterated Bellman inequalities 



8 



Author: Park et al. 

Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 



Lemma 2. Let c be a vector of state- dependent weights with c{x) > for every a; € 5. Then V* 
minimizes the linear functional c^V among all V 's satisfying the Bellman inequalities Corre- 
spondingly, the L-tuple {V*, • • • , V*) minimizes the linear functional Xl^Li '^"^^ among all L-tuples 
(Vi, . . . ,Vl) satisfying the iterated Bellman inequalities 

Proof of LemmalM The proof follows from the fact that V >V* and hence, c^{V — V*) > 0. 
Smce V* is feasible for the inequalities dS]) for any L >1, the result follows. Similarly, since the 
L-tuple {V*, - ■ ■ ,V*) is feasible for ([8]) and since Vj >V* for j = 1, 2, . . . , L, it readily follows that 
the L-tuple is optimal. □ 

3. Bounds using Partitioning 

Let the set of all states S be partitioned into M disjoint sets, Si,i = 1,...,M. We will call the 
set Si the i^^ partition. Henceforth, we will use the following notation: if f{x,u,Ys) represents the 
state the system transitions to starting from x and subject to a control input u and a stochastic 
disturbance Yg, then f(x, u, Yg) represents the partition to which the final state belongs. For a given 
u and partition index i, we define the tuple z^'" = {f{x, u, Yq), f{x, u, Yi), . . . , f{x, u, Ym)) for every 
X £ Si. We denote by T{i,u) the set of all distinct z*'" for a given partition index i and control u. 

3.1. Restricted Linear Program 

We have, from Lemma [21 that the optimal solution to the following LP, 

ELP := minc^y, subject to (12) 
V > Ru + XPuV, Vu, 

referred to as the "exact LP" in the literature, is the optimal value function V* . Let us start 
with restricting the exact LP by requiring further that V{x) = v{i) for all x (z Si, ? = !,... ,M. 
Augmenting these constraints to the exact LP, one gets the following restricted LP. 

M 

RLP := min^ ^ c{x)v{i) subject to (13) 
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m 



v{i) > Ruix) + X'^piv{f{x,u,Yi)), VxG5,, z = l,...,M, Vn. 



The restricted LP can also be written in the following compact form: 



RLP = minc^^u subject to 



(14) 



> R^ + XPu^v, Vn, 

where the columns of $ (commonly referred to as "basis functions" in the literature) are given by, 



The restricted LP typically deals with a much smaller number of variables i.e., M « \S\. An 
approximate value function can be constructed from every feasible solution to RLP according to 
Vup = $v Vup{x) = v{i), Vx G 5i, i = 1, . . . , M . Since the approximate value function satisfies, by 
construction, the Bellman inequalities ([5]), it is automatically an upper bound to V* by Lemma [TJ 
So, if V* is the optimal solution to RLP (|13p . then clearly, $v* > V* . Now we are ready to address 
one of the main results of the paper. 

Theorem 1. The optimal solution, v* , to the RLP is independent of the cost vector c once the 
partitions are specified. 

Proof of TheoremU^ The main idea behind the proof is the following: The constraints in the 
restricted LP (|13p do not, in general, correspond to those of a Markov Decision Process (MDP) 
because the transition from one partition to another for a given control u and random input Yi is 
not specified unambiguously. This is because different states in the same partition can transition 
to different partitions for the same u and Yi. If one were to think of a "random" selector for a state 
in a partition, then the specification of u, Yi together with the random selector specifies exactly 
which partition the system would transition to next, from the current partition. Let us specify the 
probability of picking a state in a partition, corresponding to the random selector, via the optimal 
dual variables for RLP. For a given partition index i, the RLP specifies a constraint on v{i) for 




i = l,...,M. 



(15) 
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each X € Si and u. Let the dual variable corresponding to this constraint be ^^(x) > and the 
corresponding optimal dual variable be fl^ix). With this definition, we can proceed to prove the 
result via the following steps: 

1. We show that for every partition index i, there is a n such that fiu{x) > for some x E Si. 
This is necessary for constructing a MDP of reduced dimension in the next step; otherwise, the 
corresponding value of v{i) is not lower bounded. 

2. We define a reduced order MDP on the partitions with immediate reward and transition 
probability given by. 



where u € Ui if Xlxes ^m(^) > 0- interpret the term 



as the probability 



of picking the state x from the partition Si. 

3. We show that the so-called "surrogate LP" obtained by aggregating the constraints of RLP 
via the optimal dual variables, 



M 



SLP{fj,) := min^^ c{x)v{i), subject to 

i—l xGSi 



(16) 



E(i) 



M 



) > r„(i) + Aj]P„(i,j>0-), yueU„i = l,...,M, 



is the exact LP corresponding to the reduced order MDP defined in step 2 above. In essence, for 



a given c, the optimal value function of the reduced order MDP is t 



re optimal so^ 



197C 



Glover 



ution of RLP. 



1975 



19681 ) to 



We use the properties of surrogate duality (jGreenberg and Pierskalla 
demonstrate that SLP{p,) = RLP. 

4. Finally, to show that the optimal solution to RLP is independent of c, we note that the 
constraints of SLP{ji) are obtained by taking convex combinations of the constraints in RLP. 
Hence, any feasible solution to RLP is also feasible for SLP{ji). Since every feasible solution of the 
exact LP corresponding to an MDP dominates the optimal solution (from Lemma (T]), we conclude 
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that the optimal solutions corresponding to two different cost functions Ci and C2 necessarily 
dominate each other and hence, have to be the same. □ 

We shall now establish the surrogate LP result via the following lemma with the proof provided 
in the Appendix. 

Lemma 3. Consider a surrogate LP for the RLP through a set of dual variables, fi given by: 

SLP{fi) := minc^z;, subject to (17) 



R.,ix) + Xj2pMfi^,u,Yi)) 



, Vn, i = l,. . . , M. 



Then, 3/i > such that, SLP{fL) = RLP, and, for every partition index i = l,. . . ,M, 3u such that 
Sx(E5 ^ 0- Moreover, the optimal solution v* to RLP is independent of the cost vector c and 

any other feasible solution v to RLP dominates v* . 

Theorem [T] implies that the upper bound for the optimal value function cannot be improved 
by changing the cost function from a linear to a non-linear function or by restricting the feasible 
set of RLP further since the optimal solution of RLP is dominated by every feasible solution 
of RLP. Also ^v* is the least upper bound to the optimal value function V* since any other 
feasible v to RLP satisfies > ^v* . Hence, a refinement of the upper bound must necessarily 
involve an enlargement of the feasible set if one wants to stick to an LP formulation, i.e., it should 
include the feasible set of (|13p and possibly other tighter upper bounds than the optimal solution 
of RLP. Lifting of variables is one way to improve the bound; in this connection, we show in 
the following section that neither a general lifted LP nor one obtained by including the iterated 
Bellman inequalities in the constraint set improves the upper bound. 

Remark 2. If one considers the sub-optimal dual variables, ^ilix) = Vx G Si,\lu, then solving 
the corresponding surrogate dual, SLP{fi), to obtain an approximate valu e function, would result 



in the so-called "hard aggregation" method (see Sec. 4 of iBertsekasI (|2007[ )). 

Remark 3. When fi and $ are allowed to have arbitrary positive entries satisfying — 
l,Vi G {1, . . . ,M} and ^iu^j) = Ij^J/ € S, the method is referred to as "soft aggregation" 
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(jSingh et al.lll995l ). Unfortunately, in this case, the optimal solu tion to the restricted LP forni ula 



tion (I14p has been shown to be dependent on the cost function (iDe Farias and Van Roy 



20031). 



3.2. Lifted Restricted Linear Programs 

It may appear that we can get tighter upper bounds than those provided by the RLP by considering 
either lifted LPs whose feasible set is larger than that of RLP or LPs with a different objective 
function. We will show, in this section, that unfortunately this is not the case. In general, one can 
construct a lifted LP of the form: 

LLP := mincFv -\- cf^ z, subject to 

m 

V{x) > R^{x) + XY,Pi{V{f{x,u,Yi))),yx,u, (18) 
V{x) = v{i),yxeS„i = l,...,M, (19) 
z > 0, 

where z is the additional vector of variables used in lifting so that the feasible set is not empty. Then, 
it follows that if {v, z) is optimal to LLP, then v will be a feasible solution to RLP. Consequently, 
> ^v*, where v* is the optimal solution of the RLP. In other words, one gets no better bound 
via lifting if the constraints (jlSp and (|19p are included. One could also use the iterated Bellman 
inequalities (l8|) for constructing a lifted LP of the form: 

L 

IB := min^^c^fj, subject to 

m 

Vj+i{i) > Ru{x) + X^PiVj{f{x,u,Yi)), yxeS,, \/i,u, j = l,...,L-l, (20) 

m 

viii) > R^{x) + X^PiVL{f{x,u,Yi)), yxeS,,yi,u. (21) 

Again, it turns out that the above lifted LP is incapable of providing a better bound, as can be 
seen from the following result. 

Theorem 2. If vjb = {vi,- ■ ■ ,Vl) is a feasible solution to IB, then Vj > v* for j = 1, . . . , L, where 
V* is the optimal solution to RLP. 
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The proof for Theorem [2] fohows afong the Unes of Lemma [3l We wih construct a surrogate LP 
for the hfted LP (|20p with the optimal dual variables of RLP. We immediately recognize that the 
inequalities defining the surrogate LP are, in fact, the iterated Bellman inequalities associated with 
the reduced order MDP defined in step 2 of the proof of Theorem [TJ So, the result follows from 
Lemma [2] and Remark [TJ 

Proof of Theorem \^ Let Jl be the optimal dual variables to RLP (jl3D . From Lemma [3l for 
every partition index i G {1, . . . , M}, there exists a u such that Ylixes P-ui^) > 0. For a fixed i and 
u, we multiply the inequalities (f20l [2T]) associated with a particular x £ with and sum 

over all the x S Si. Then, we get the following surrogate LP: 



SIB := miny^c^fj, subject to 

m 

v,+i{i) > r^{i) + X^hl{x)^piv,{f{x,u,Yi)), yueU^Vi, j = l,...,L-l, (22) 

m 

vi{i) > r^{i) + \Y,K{x)Y,piVL{f{x,u,Yi)), yueU„yi, (23) 
xeSi 1=0 

where, uGUi if X^^es Uui^) > 0- before, the one-step reward function. 

By LemmaO the optimal solution to SIB is of the form Vgjg = (v*, ■ ■ ■ ,v*), where v* is the optimal 
solution to SLP{p,) (and by Lemma [3l also the optimal solution to RLP). Since any feasible 
solution to IB, vjb = {vi, . . . ,Vl} is also feasible to SIB, it follows, from Lemma [H that Vj > v* 
for every j = 1, . . . , L. □ 

So, we conclude that lifting through the use of iterated Bellman inequalities does not help in 
finding a tighter upper bound than the RLP optimal solution. Also using any other non-linear 
objective function will not improve the upper bound as long as the iterated Bellman inequalities 
(|20p and (j2ip are included in the constraints set. In the next section, we focus our attention on the 
construction of a lower bound for the optimal value function. 
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3.3. Lower Bound for the Optimal Value Function 

For any candidate approximate value function V, one can construct a sub-optimal "greedy" policy 
according to: 

7f(x) = argmax|i?„(x) + A^P„(a;,y)F(y)| , Vx G 5. 

Let us define the improvement in value function, a{x) := Ra-ix) + A P#(x, — V{x). Note 
that there is no improvement, i.e., a = 0, when V = V* . The expect ed discounted payoff, T^, 



corresponding to the suboptimal policy tt, satisfies the following bound (jPorteus 



19751 ): 



V{x) + mina(y) < V^{x) < V*{x), Vx G S. 
1 — A y 

In our experience, the lower bound to the optimal value function provided by is very conservative. 
Also computation of 14 involves solving a linear system of equations of size \S\, which would be 
expensive for a large state-space. So, we construct a novel alternate lower bound as follows. Recall 
that for each x € 5^, V*{x) satisfies the Bellman inequality ([5]): 



V*{x) > R,{x) + Xj2piV*{f{x,u,Yi)), 



1=0 



>Ruix) + Xy^Pi min V*{y), Vn. (24) 

^ V€f{x,u,Yi) 

Let := miua-g^^ V*{x), i = 1, . . . ,M. Then, it follows from that, 

> min|i?„(x) + AVpiU;(/(x,n,>^)) I Vn, i = 1, . . . , M. (25) 

L (=0 J 

The above set of inequalites motivates the following non-linear program: 

NLP := mm c^w, subject to 
w{i) > mm{Ruix) + xf"piw{fix,u,Yi))\ , Vn, i = 1, . . . , M. (26) 



(=0 



Let w* be the optimal solution to NLP. By construction, we see that n; is a feasible solution to 
the NLP and hence, 

M 

(Fw*<(Fiu = c{i) mm V*[x). 



Author: Park et al. 

Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 



15 



So, by choosing c{i) = 1 and c{j) = for all j ^ i, one can obtain a lower bound to the optimal 
value function for all the states in the i*'' partition. Moreover, if the problem under consideration 
exhibits a special structure, one can show that NLP collapses to an LP that can be efficiently 
solved. The perimeter patrol problem considered herein exhibits such a structure; we demonstrate 
this in the next section. 



Remark 4. The NLP is referred to as a disjunctive linear program (jBalaall979l ) and the optimal 
solution to NLP is the solution that minim i zes th e same linear objective function over the convex 
hull of the feasible solutions of A^LP. lBalaa (|l998l ) provides two methods to solve the problem: one 



through a lifted representation for the convex hull of the feasible set of NLP and the other through 
a cutting plane technique. Since the number of lifted variables is of 0{M'^\U\); if M = 10, 000, then 
one must deal with a lifted LP with 100 million variables. The original (non-aggregated LP) has 
about 10 million variables and hence, the lifted representation method is not practical. For this 
reason, the cutting plane technique is a viable alternate method. 

Remark 5. The lower bound provided by NLP is a non-trivial one because the optimal solution 
is the optimal value function of a reduced order MDP. Hence, the lower bound will be better 
than at least the value function associated with some suboptimal policy and so, is non-trivial and 
non-conservative. 

Remark 6. While Si may have a lot of states, the number of entries on the right hand side of the 
non-linear constraint (j26p over which the minimization must be carried out is the cardinality of 
T{i,u). NLP is combinatorial in nature, in the sense that one must pick one (m + 1) tuple for each 
i and u over which the optimization must be carried out. However, for each {m + 1) tuple picked, 
one obtains an MDP. So, the system of inequalities (j26p describes a family of underlying MDPs. 

4. Perimeter Patrol Problem 



The perimeter patrol proble m arose from t 



(COUNTER) project at AFRL (jGross et al 



20061 ) 



re Cooperative Operations in Urban Terrain 
In this problem, there is a perimeter which must 
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Figure 1 Perimeter patrol scenario with UAV loitering at alert station. 




be monitored by a collection of UAVs (we will consider only one UAV here). Along the perimeter, 
there are m alert stations equipped with Unattended Ground Sensors (UGSs) which detect intru- 
sions or incursions into the perimeter. For the sake of simplicity, we assume that incursions into the 
perimeter can only occur at the stations. An incursion could be a nuisance (false alarm) or a real 
threat. The UGS raise an alarm or an alert whenever there is an incursion. The camera equipped 
UAV responds to an alert by flying to the alert site and loitering there, while a remotely located 
operator steers the gimballed camera looking for the source of the alarm. Here the operator serves 
the role of a classifier or a sensor, i.e., the operator must determine, from the video information, 
whether the intrusion is a nuisance or a threat. For details on the perimete r alert patrol problein 



and the variants thereof, we refer the reader to the authors' prior work (IChandler et al. 



Darbha et al 



2010 



Krishnamoorthv et al. 



2011b 



20091 . 



aj). Figure [J shows a typical scenario, where there 



are 4 alert stations with the UAV at a station (location 0) with an alert. The decision problem 
we solve is the following: Given that the arrival process of the alerts is Poisson with known arrival 
rate, what is the optimal time a UAV should spend at a station before resuming its patrol? We 
associate an information gain with a UAV loitering and servicing an alert and we model this gain 
as a monotonically increasing function of the loiter/dwell time d. 
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4.1. Problem Statement 

The patrolled perimeter is a simple closed curve with A'^(> m) nodes which are (spatially) uniformly 
separated, of which m correspond to the alert stations. Let the m distinct station locations be 
elements of the set C {0, . . . , — 1}. A typical scenario shown in Figure[T]has 15 nodes, of which, 
nodes {0,3,7,11} correspond to the UGS. Here, station locations 3, 7 and 11 have no alerts, and 
station location has an alert being serviced by the loitering UAV. At time instant t, let i{t) be 
the position of the UAV on the perimeter G {0, . . . ,N — 1}), d{t) be the dwell time (number of 
loiters completed if at an alert site) and Tj{t) be the delay in servicing an alert at location j € ^. 
Let yj{t) be a binary, but random, variable indicating the arrival of an alert at location j € 0. 
We will assume that the statistics associated with the random variable yj{t) are known and that 
Hj'jj € are independent. We model the arrival of alerts as follows: There is a single queue with 
a Poisson arrival stream of alerts at a rate of a alerts per unit time. After an alert is queued up, 
we assume it shows up arbitrarily at any one of the m stations (assuming choice of station is a 
uniformly distributed random variable). For this reason, only one alert can arrive at one of the m 
stations at any instant of time. Hence, there are m + 1 possibilities for the value of the vector of 
alerts y{t) = [yi{t) 2/2 (i) ■ • ■ ym{t)], with the first one being that there is no alert at any station and 
the other m correspond to an alert at each of the m stations. The control decisions are indicated 
by the variable u. If u = 1, then the UAV continues in the same direction as before; if u = — 1, then 
the UAV reverses its direction of travel and if n = 0, the UAV dwells at the current alert station. 
We will assume that a UAV advances by one node in unit time if ti ^ 0. We also assume that the 
time to complete one loiter is also the unit time. We denote the UAV's direction of travel by w, 
where u = 1 and w = — 1 indicate the clockwise and counter-clockwise directions respectively. One 
may write the state update equations for the system as follows: 

i{t + l) = [£{t) +ij{t)u{t)] mod N, 
uj{t + 1) = oj{t)u{t) + 5{u{t)), 

d{t + l) = {d{t) + l)6{u{t)), (27) 
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T,{t + l) = (r,(t) + l){(l-5(^(t)-j)<5(n(t))}maxMr,(t)),y,(t)}, Vj € 0, 

where S is the Kronecker delta function and a{-) = 1 — 5{-). We denote the status of the alert at 
station location j € at time t by Aj{t), i.e., 

Also, we have the constraints: u{t) = only if £{t) € Q and d{t) < D. If d{t) = D, then u{t) ^ i.e., 
the UAV is forced to leave the station if it has already completed the maximum (allowed) number 
of dwell orbits. Combining the different components in (\27\i . we express the evolution equations 
compactly as: 

x(t + l) = /(x(t),^/(t),y(t)), 

where, x{t) is the system state at time t with components i{t), uj{t),d{t) and Tj{t), \/j G ^l. Let us 
denote the m + 1 possible values that y{t) can take by the row vector Yi where, 

yo=[oo...o], yi=[io...o], ... and y„=[o...oi]. (29) 

Given a Poisson arrival stream of alerts at the rate of a alerts per unit time, the probability that 
there is no alert in unit time interval is p = e~" and hence, the probability that y{t) takes any one 
of the m + 1 possible values in ([29|) is given by, 

pr.= Prob{y{t) = Y,} = \P^ J: J' ^ (30) 

To be consistent with the notation introduced earlier (in Sec [2]), we shall use S to denote the set 
of all system states and use x € {1, . . . , |5|} to denote a particular state. Our objective is to find 
a suitable policy that simultaneously minimizes the service delay and maximizes the information 
gained upon loitering. The information gain, I, which is based on an operator error model (see 
Appendix lEC.ip , is plotted as a function of dwell time in fig. [2l We model the one-step payoff/ 
reward function as follows: 

Ruix) = [I{dx + l) -I{d.^)]5{u) - p max{f^,r}, x = l,...,\S\, (31) 
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Figure 2 



Value of Information gained vs dwell time. 
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where d.^ is the dwell associated with state x and = maXj^Q Tj.x is the worst service delay (among 
all stations) associated with state x. The parameter r(>> 0) is a judiciously chosen maximum 
penalty. The positive parameter p is a constant weighing the incremental information gained upon 
loitering once more at the current location against the delay in servicing alerts at other stations. 
From the state definition, we can compute the total number of states in the MDP to be, 



where, the factor 2 comes from the UAV being bi-directional. For the loiter states, directionality 
is irrelevant and hence when d > 1, we reset a; to be 1. Note that, in lieu of the reward function 
defintion (j31|) . we do not keep track of delays beyond F and hence the state-space S only includes 
states X with < F, Vz G and so, is finite. We immediately see that the problem size is an m*'* order 
polynomial in F and hence solving for the optimal value function and policy using exact dynamic 
programming (DP) methods are rendered intractable for practical values of F and m. Hence, we 
employ the restricted LP approach developed earlier to compute approximate value functions; from 
which we compute the corresponding greedy sub-optimal policy. In the next section, we exploit the 
structure in the perimeter patrol problem to simplify the RLP and NLP formulations and show 



S\ = 2x N X {r + 1)™ + Dxmx{r + 1) 



(32) 
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that both cohapse to exact LPs corresponding to MDPs defined on the M partitions. 
4.2. Structure associated with the Perimeter Patrol Problem 

In the perimeter patrol problem considered herein, we see that, by definition ( 13ip . the reward 
function ii„(x) is bounded. Consequently the optimal value function is bounded. To explain the 
inherent structure in the reward, consider a station where an alert is being serviced by a UAV. 
The information gained by the UAV about the alert is only a function of the service delay at the 
station and the amount of time the UAV dwells at the station servicing the alert. There is a natural 
partitioning of states; where no matter what the delays are at the other stations, the reward is the 
same, as long as the maximum delay and the dwell time of the UAV at the station are the same. So, 
we aggregate all the states which have the same values for i, uj, d, Aj, Vj G Q and r = maxjgo Tj, 
into one partition. As a result of aggregation, the number of partitions can be shown to be, 

M = 2xN + 2xNx{2"'-l)xr + mxD + mxDx (2""^ - 1) x T, (33) 

which is linear in T and hence considerably smaller than the total number of states (|32p . 

We introduce the following notation, that will be used hereafter: Let ij:,dx,ujx,Tj^x and Aj^^ 
represent respectively, the location, dwell, direction of UAV's motion and the service delay and 
alert status at station location j G 51 corresponding to some state x G {1, . . . , Also, we will 
use i{i),d{i),uj{i), f{i) and Aj{i) to denote the location, dwell, direction, maximum delay, and the 
alert status at station location j (zO, that correspond to some partition index i G {1, . . . ,M}. We 
will also denote by Xp, U(,yt) the state at time t > 0; if the initial state at t = is Xq and the 
sequence of inputs, U( = {n(0),n(l), . . .,u{t — 1)} and disturbances, = {y(0),y(l), .. .,y{t — 1)}. 
We also introduce a partial ordering of the states according to: x >y iff ix = ^y, dx = dy, uj^ = ojy 
and Tj .J. > Tj,y, Mj G O. By the same token, we also partially order partitions, Si > Sj iff for every 
z &Sj, there exists an x G 5^ such that x> z. Recall that T{i,u) is the set of all distinct (m + 1) 
tuples of partition indices, that the system can transition to, from partition Si under control action 
u. For the sake of notational simplicity, we denote the component of any tuple k G T{i,u) by 
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ki-i and the cardinahty of the set T{i,u) by \T{i,u)\. Also we define the partitions to be of two 
types: a partition Si is of type 1 and we write i £Vi if £{i) € ^l, d{i) = 0, Ai{i){i) = 1, and Aj{i) = 
1, for some j € 17, j 7^ ^(i), i.e., the UAV is at a station with an alert, the dwell time is zero and 
also there is an alert at some other station. Else it is of type 2 and we write z € 7^2 • Given this 
definition, we have the following important result, that we will make use of, in the remainder of 
the paper. 

Lemma 4. The cardinality ofT{i,u) is given by: 

[1, otherwise. 

Proof of Lemma\^ First we consider partition index i of type 1 and control input u = Q. 
Since the UAV has decided to loiter at the current station i.e., l{i) € 0, the service delay at 
that station, r£(i) will be reset to zero in the next time step. Hence the future state (and par- 
tition) maximum delay will be determined by the highest of the service delays, say f, among 
the other stations with alerts (at least one such station exists since partition i is of type 1). So 
Vj € {1, . . . , T(i)}, 3 Xj € Si such that t^,. = j. The corresponding tuple of future partition indices 
^xj = {f{xj,u,Yo),f{xj,u,Yi),...,f{xj,u,Yjn)) will have maximum delay j + 1 and so T{i,0) = 
U^i*j{2;^'°} =^ |T(i,0)| = f{i). For all other control choices, u^O, all the states x € 5^ will transi- 

'3 

tion to future states with the same maximum delay f{i) + 1. So, for u 7^ 0, T{i,u) is a singleton 
set and hence \T{i,u) \ = 1. For partition indices j of type 2 with f(j) > 0, all the states x € Sj 
will transition to future states with the same maximum delay f{j) + 1 and so \T'{j,u) \ = 1, Vii. If 
f{j) = 0, then the partition Sj is a singleton set as per the aggregation scheme (see Sec 14. 2p and 
hence \T{j,u) \ = 1, Vu. □ 

Theorem 3. For the perimeter patrol problem, the NLP Ii26\) reduces to the following LP. 

LB LP := minc^w, subject to 

m 

w{i) > r,,ii) + Xj2piw{ki),yu,i = l,---,M, (34) 
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where the tuple k € T{i,u), if \T{i,u) \ = 1, else k = k* , where k* S T{i,u) is the tuple of partition 
indices such that f[k^) = r(z) + 1, / = 0, . . . , m. Furthermore, the optimal solution, w* is dominated 
by every feasible w for the NLP and, in particular, it is a lower bound to the optimal value function 
i.e., for all i = 1, . . . , M , one has w*{i) < minj-g^. V*{x) . 

Before proceeding further, we make two key claims that are essential for the proof of Theorem [3j 
The justification for the claims have been provided in the Appendix. 

Claim 1. If Xi>X2, then for the same sequence of inputs Uf and disturbances y^, the system state 
evolves in such a way that x{t; Xi, u^, y^) > x{t; X2, Uj, y^) for every t>0. 

Claim 2. If xi > X2, then V*{xi) < V*{x2). Furthermore, if Si > Sj, then min.j.g5. y* (x) < 
min^g^. V*{z). 

Proof of Theorem\^ Recall the non-linear constraints (|25p satisfied by w{i) := min^g^, 
that motivated the NLP formulation: 

w{i)>m.m{Ru{x) + \y^Piw{f{x,u,Yi))\, Vn, i = l,...,M, (35) 

I i=0 ) 

which, given the definition of T{i,u), can be written in the following equivalent form: 

m 

w{i)>ru{i)+\ min > piwiki), Vu, i = l,...,M, (36) 

1=0 

where r„(z) is the reward associated with partition index i, and given the partitioning scheme, 
satisfies Ru{x) = r^ipj^^x G S^. Given the structure in the perimeter patrol problem, we will show 
that the above (|36p will collapse to a single linear inequality constraint for every partition index i 
and control u. Let us focus our attention on partition index i of type 1 and control action n = 0. 
For this choice, the cardinality of T(?,0) is f{i) as per Lemma [H Indeed 3 x S 5.^ such that the 
corresponding tuple of future partition indices k* = (/(x, 0, yo)) /(^i, 0, Yi), . . . ,/(x,0,y,„)) has the 
highest possible maximum delay, i.e., t(/c;*) = f[i) + 1, / = 0, . . . , m. Since ki >ki, I = 0, . . . ,m, \/k £ 

Tiiju), we have from Claim [2] that, w{ki) < w{ki), 1 = ,m, V/c € T{i,u). So, the non-linear 

inequality corresponding to partition index i ^Vi and control u = becomes: 

m 

wii)>ro{i) + XY,Piw{k:). (37) 

1=0 
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If n 7^ 0, then \T{i,u) \ = 1. So there exists exactly one tuple k in T{i,u) and hence, the non-linear 
constraint (|36|) reduces to the linear inequality: 

m 

w{i)>r^{i) + XY,PMki)- (38) 

(=0 

For partition indices j of type 2, \T'{j,u) \ = 1, Vn. So, as before, the non-linear inequality (i36|l 
collapses to the linear inequality ([38|) . 

In summary, we have the following: regardless of which partition one considers, the corresponding 
non-linear constraint in NLP collapses to a linear constraint and hence, NLP for the perimeter 
patrol problem collapses to the following LP: 

LB LP := min c^w, subject to 

m 

w{i) > r^{i) + XY,Piw{ki), Vn, i = 1, . . . ,M, (39) 

1=0 

where the tuple k G T{i,u), if \T{i,u)\ = 1, else k = fc*, where k* € T{i,u) is the tuple of partition 
indices such that f{kl) = f{i) -|- 1, Z = 0, . . . , m. 

To prove the second part of the Theorem, we observe that LBLP defined above is the exact LP 
corresponding to a reduced order MDP defined on the M partitions. Hence, we readily have from 
Lemmas [1] and [2] that the optimal solution w* lower bounds every feasible solution including w and 
hence, w*{i) <w{i) =m:m..^^s,V*{x) <V*{y),yyeS„i = l,...,M. □ 

So, for the perimeter patrol problem, one can compute a lower bound for the optimal value 
function efficiently by solving LBLP. The next logical question is whether the upper bound for- 
mulation, RLP OlSp . also simplifies, given the structure in the problem. It turns out that this is 
indeed the case, as can be seen from the following theorem. 

Theorem 4. For the perimeter patrol problem, the RLP ^3\} reduces to the following LP. 

UBLP := minc"^u;, subject to 

wii) > r^{i) + X^piw{ki), V-u, i = l,...,M, (40) 

where the tuple k € T{i,u), if \T{i,u) \ = 1, else k = k* , where k* € T{i,u) is the tuple of partition 
indices such that f{ki) = 2, / = 0, . . . , m. 
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Proof of Theorem\^ Given the partitioning scheme, one can rewrite the Bellman inequalities 
as follows: for each z = 1, . . . , M, 

V*{x) >r.,{i) + X^p,V*if{x,u,Yi)), yu, VxGcS,. (41) 

/=o 

With the restriction that V{x) = v{i),'\/x £ Si, we get the fohowing constraint for RLP (fT3|) . 

vii)>r^{i) + \^Piv{ki), Vfe G r(i,n),VM, i = 1, . . . ,M. (42) 

For partition index i €Vi, 3 x € Si that transitions to future states with the least possible max- 
imum delay, 2. Hence f{x, 0, Yi) < f{x, 0,Yi), I = 0, . . . ,m, \/x £ Si and so from Claim [2] we have, 
V*{f{x,0,Yi))>V*{f{x,0,Yi)), l = 0,...,m, ^xeS,. So, for i G T'l andti = 0, the inequalities (glD 
can be written as follows, 

m 

V*{x) > ro{i) + XY,PiV*ifi^Ayi)) Vx G 5,. (43) 

1=0 

The above implies that the f{i) constraints ([^2|) in RLP can be replaced by the single constraint, 

m 

v{i)>ro{i) + X^Piv{k;), (44) 

/=o 

where k* = {f{x,0,YQ),f{x,0,Yi),...,f{x,0,Yjn)) is the tuple of future partition indices (corre- 
sponding to x) with the least possible maximum delay, i.e., f{ki) = 2, I = 1, . . . ,m. For the other 
control choices, u^O, there exists only one tuple k in T(i,u) (since \T{i,u) \ = 1) and hence the 
constraint ()42p is the single constraint, 

m 

vii)>r.uii) + X^Pivik),u^O. (45) 

1=0 

Similarly, for partitions Sj of type 2, |T(j, u)| = 1, Vu, and so the constraint ([^2|) is the single 
constraint ([l5|) . 

In summary, we have the following: regardless of which partition index i G {1, . . . , M} and control 
action u are considered, the corresponding \T{i,u)\ linear constraints in RLP collapse to a single 
constraint and hence, RLP for the perimeter patrol problem reduces to the following exact LP: 

UBLP := minc'^w, subject to 
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> r^{i) + X"^piw{ki), yu, i = l,...,M, (46) 

(=0 

where the tuple k € T{i,u), if |T(i,n)| = 1, else k = k*, where k* € T{i,u) is the tuple of partition 
indices such that f{ki) = 2, / = 0, . . . , m. □ 

In conclusion, we have two complementary LP formulations, UBLP and LBLP that can be 
used to efficiently compute upper bound and lower bound approximate value functions respectively, 
for the perimeter alert patrol problem. Note that the two formulations involve computing the 
optimal value functions for reduced order MDPs defined over the M partitions and in that sense 
are computationally attractive (compared to solving the original problem) since M « \S\. In the 
following section, we will provide numerical results that corroborate the key claims made earlier 
regarding the structure in the perimeter alert patrol problem. 

5. Numerical Results 

We consider a perimeter with A'^ = 15 nodes of which node numbers {0,3,7,11} are alert stations 
and a maximum allowed dwell of = 5 orbits. The other parameters were chosen to be: weighing 
factor, p = .005 and temporal discount factor, A = 0.9. Based on experience, we chose the alert 
arrival rate a = This reflects a rather low arrival rate where we expect 2 alerts to occur on 
average in the time taken by the UAV to complete an uninterrupted patrol around the perimeter. 
We set the maximum delay time, that we keep track of, to be F = 15; for which the total number 
of states comes out to be |5| = 2,048,000. Before venturing into the simulation, we first provide 
numerical results that corroborate the key Claim [21 made earlier in the paper. For this, we solve 
for the optimal value function V* . This is possible since the size of the example problem considered 
in this section is small and hence an exact solution can be obtained. In Figure [3l we show results 
supporting the claim that for partially ordered states Xi > X2, the corresponding optimal value 
functions satisfy V*{xi) < V*{x2)- For this, we plot the optimal value function V* corresponding 
to states with alert status Aj = 1, Vj G Q (all stations have alerts), dwell d = 0, direction w = 1 and 
the UAV located at one of the four station locations £ The partially ordered states represented 
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Figure 3 Monotonically decreasing value function corresponding to partially ordered states with increasing max- 
imum delay. 

Monotonic decreasing value function w/ Partial Ordering of States 




Figure 4 Monotonically decreasing least value function corresponding to partially ordered partitions with increas- 
ing majcimum delay. 

lUlonotonlc decreasing minimum value function w/ Partial Ordering of partitions 



Optimal value function 
Least value in partition 




in the X-axis are non-decreasing from left to right with maximum delay r varying from 2 to F. The 
dotted grid lines in the plot separate the different partitions that the states fall into. In FigureSl we 
show results supporting the claim that for partially ordered partitions Si > Sj, the corresponding 
optimal value functions satisfy min^-gg, V*{x) < min^gg^. V*{y). For this, we plot the value functions 
corresponding to states with alert status A = 1001 (station locations and 11 have alerts), dwell 
d = 0, direction uj = 1 and £ = 0. The partially ordered partitions demarcated by the dotted grid 
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lines in the X-axis are non-decreasing from left to right with maximum delay r varying from 2 to 
r. Within each partition, we plot the value function associated with every state in the partition 
and also the least value function in the partition shown as the green line. One can easily see that 
the claim above is satisfied. 

In the next section, we shall consider the same example problem and show that the proposed 
approximate methodology is effective. For this, we compute the approximate value functions via 
the restricted LP formulation and compare them with the optimal value function. In addition, we 
also compute the greedy sub-optimal policy corresponding to the approximate value function and 
compare it with the optimal policy in terms of the two performance metrics: alert service delay 
and information gained upon loitering. 

5.1. Simulation Results 

We aggregate the states in the example problem based on the reward function (see section 14.21 
for details). This results in M = 8900 partitions, which is considerably smaller than the original 
number of states, \S\. We solve both the UBLP and LBLP formulations which give us the upper 
and lower bounds, v* and w* respectively, to the optimal value function V* . Since we have the 
optimal value function for the example problem, we use it for comparison with the approximations. 
Note that for higher values of m and F, the problem would essentially become intractable and one 
would not have access to the optimal value function. Nevertheless, one can compute v* and w* 
and the difference between the two would give an estimate of the quality of the approximation. We 
give a representative sample of the approximation results by choosing all the states in partitions 
corresponding to alert status Aj = 1, Vj € (all stations have alerts) and maximum delay f = 2. 
Figure [5] compares the optimal value function V* with the upper and lower bound approximate 
value functions, V^p = ^v* and Viow = ^w* for this subset of the state-space. The first 15 partitions 
shown in the X-axis of Figure [5] i.e., partition numbers, i = 1, . . . , 15, correspond to the clockwise 
states: 

£ = i-l, d = 0, a; = l, f = maxrj = 2, = 1, Vj € fi, (47) 
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Figure 5 Comparison of approximate value functions with the optimal. 

Comparison of Optimal Vaiue Function witli Bounds 




Partition Number 

and the last 15 partitions shown in the X-axis i.e., partition numbers, i = 16, . . . ,30, correspond to 
the counter-clockwise states: 

£ = i-N-l, d = 0, uj = -l, r = maxrj=2, Aj = l,\/jeQ. (48) 

Interestingly, we notice immediately that the lower bound appears to be tighter than the upper 
bound. Recall that our objective is to obtain a good sub-optimal policy and so, we consider the 
policy that is greedy with respect to VJou,: 

7r,(x)=argmax|i?,(x)+A^p,;4<,.„(/(2;,n,l^))|, Vx e {1, . . . , |cS|}. (49) 

To assess the quality of the sub-optimal policy, we also compute the expected discounted payoff, 
Vsub that corresponds to the sub-optimal policy vr^, by solving the system of equations: 

(/-AP,jK„b = i?,,. (50) 

Since V^ub corresponds to a sub-optimal policy and in lieu of the monotonicity property of the 
Bellman operator, the following inequalities hold: 

Vlon. < Vsub <V*< Kp. 
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Figure 6 Comparison of value function corresponding to suboptimal policy tts with the optimal. 
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In Figure m we compare Vsub with the optimal value function V* for the clockwise states defined 
in (I47p and note that the approximation is quite good. Finally, we compare the performance of the 
sub-optimal policy tt^ with that of the optimal strategy vr* in terms of the two important metrics: 
service delay and information gain (measured via the dwell time). To collect the performance 
statistics, we ran Monte Carlo simulations with alerts generated from a Poisson arrival stream 
with rate a = over a 60000 time unit simulation window. Both the optimal and sub-optimal 
policies were tested against the same alert sequence. Figure [7] shows histogram plots for the service 
delay (top plot) and the dwell time (bottom plot) for all serviced alerts in the simulation run. 
The corresponding mean and worst case service delays and the mean dwell time are also shown in 
Table [TJ We see that there is hardly any difference in terms of either metric between the optimal 
and the sub-optimal policies. This substantiates the claim that the aggregation approach gives us a 
sub-optimal policy that performs almost as well as the optimal policy itself. This is to be expected, 
given that the value functions corresponding to the optimal and sub-optimal policies are close to 
each other (see Figured]). Since the false alarm rate a is fairly low, we see from the bottom plot of 
Figure [7| that roughly 90% of the alerts were cleared within ten time steps. Also from the top plot 
of Figure [71 we see that maximum information was gained (5 loiters completed) on almost 90% of 
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Figure 7 Comparison of service delay and number of loiters between optimal and sub-optimal policies. 
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Table 1 Comparison of alert servicing performance between optimal and 

sub-optimal policies. 
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the serviced alerts. 
6. Conclusions 

We have provided a state aggregation based restricted LP method to construct sub-optimal poli- 
cies for stochastic DPs along with a bound for the deviation of such a policy from the optimum 
value function. As a key result, we have shown that the solution to the aggregation based LP is 
independent of the underlying cost function and we do so by demonstrating that the restricted LP 
is, in fact, the exact LP that corresponds to a lower dimensional MDP defined over the partitions. 
We also provide a novel non-linear program that can be used to compute a non-trivial lower bound 
to the optimal value function. In particular, for the perimeter patrol stochastic control problem, we 
have shown that both the upper and lower bound formulations simplify to exact LPs corresponding 
to some reduced order MDPs. To do so, we have exploited the partial ordering of the states that 
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comes about because of the structure inherent in the reward function. It would be interesting to 
see if the simphfication can be achieved for other problems that exhibit a similar structure. For 
the perimeter patrol problem, numerical results obtained via Monte Carlo simulations show that 
the sub-optimal policy obtained via the approximate value functions perform almost as well as the 
optimal policy. The literature suggests that, in general, the solution to a restricted LP depends on 
the underlying cost function; when the value function is parameterized by arbitrary basis functions. 
We have shown that, for the special case of hard aggregation, this is not true. Surely, there exist 
other basis functions with the same property and it would be useful to uncover the class of basis 
functions, for which the independence result holds. 
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Appendix to "Bounding Procedures for Stochastic Dynamic 
Programs with Apphcation to the Perimeter Alert Patrol 

Problem" by Park et al. 

This appendix contains supplementary material to the paper and also lengthy proofs that were 
left out of the main document. 
EC.l. Operator Error Model 

We treat the operator as a sensor-in-the-loop automaton. The operator is not infallible and we 
account for that statistically in the optimization. To quantify the operator's performance, we 
consider two random variables: the variable X that specifies whether the alert is a real threat 
(target T) or a nuisance (false target FT) and the operator decision Z which specifies whether 
he determines the alert to be a real threat Zi or a nuisance Z2. We stipulate that the a priori 
probability that an alert is a real target, 

Prob{X = T}=p«l. (EC.l) 

We assume, based on experience, that p = 0.01 in this work. The conditional probabilities which 
specify whether the operator correctly reported a threat and a nuisance are assumed to be functions 
of the dwell time, d: 

PTR{d) := Proh{Z = Z^\X = T] = a + 6(1 - e"'^!'^), 
PpTR{d) := Prob{Z = Z2\X = FT} = c + 5(1 - e-'^^rf). (ec.2) 

where the acronyms TR and FTR stand for Target Report and False Target Report respectively. 
The parameters a, 6, ^1, c, g, ^2 characterize the "confusion matrix" and the performanc e of th e 



Kish et al 



(|2009l) . 



operator as a sensor; for details on sensor performance modeling, see Sec 7.2 in 
The parameters satisfy the constraints: 

0<a + 6<l, 0<c + 5<l, /ii>0 and //2 > 0. 

In this work, we chose a = c = 0.5, b = g = 0.45 and fii = fJ,2 = 1- The choice a = c = 0.5 correspond 
to an uninformed or unbiased operator, i.e., the operator cannot tell if the alert is a threat or a 
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nuisance without having seen any video footage of the alert s ite. We wish to maximize the mutual 



2006h - between 



information - derived along the lines of information theory (j Cover and Thomas 
the random variables X and Z given by: 

T{X; Z) = H{X) - H{X\Z) 

where H{X) is the entropy of X and H{X\Z) is the conditional entropy of X given Z. Using 
Bayes' rule and the probabilities (lEC.ip and (lEC.2p . one can show that the mutual information is 
a function of dwell time, d: 

Ptr 



Z{d) = pPtr log 



p(l -PTfl)log 



pPtr + {1-p){1-Pftr) 
1 — Ptr 



+ (l-p)(l-P^Tii)log 



p{1-Ptr) + {1-p)Pftr 

1 — PpTR 



pPtr + {1-p){1-Pftr) 



+ {l-p)PFTR\og ff^^ (EC.4) 

p{l - Ptr) + (1 -pjPpTR 

since the conditional probabilities, Ptr^ and Pftr. are both functions of d ()EC.2p . 

EC. 2. Proofs to lemma in Section 12.11 

Lemma [TJ Let the vector V satisfy the following set of inequalities: 

[l-\'^P^]V>[l + \P^ + --- + \''-^P^-^]R^, Vvr. (EC.5) 

Then, we have V '>V* . 
Proof of LemmaUi For every stationary policy vr, we have: 

[/ - A^Pi^] V >[I + XP^ + --- + X'^-^Pt^] R^. (EC.6) 

Since P^^ is a stochastic matrix (i.e., it is non-negative and its row sum equals 1), and A G [0,1), 
the matrix [/ — A'^Ptt^] admits the following analytic series expansion: 

[/ - A^P/] = / + A^P/ + A^^P^^^ + . . . . 
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So, all the entries of [/ — A^-P^^] are non-negative and hence (1EC.6P implies the following 
(although the converse is not true!): 

oo 

V>[I- X'^P^] [/ + AP, + • • • + X'^-'P^-'] R^ = Y, ^'KR^, V^. (EC.7) 

4=0 

So, V dominates the expected payoff associated with every pohcy vr, inchiding the optimal pohcy 
vr*. Hence y>y*. □ 

EC.3. Proof to lemma in Section 13.11 

Lemma [3J Consider a surrogate LP for the RLP through a set of dual variables, fi given by: 

SLP{fi) := minc'^f, subject to (EC. 8) 



R.^{x) + xY,Piv{f{x,u,Yi)) 



1=0 



, Vn, i = l,. . . , M. 



Then, 3/2 > such that, SLP{fi) = RLP, and, for every partition index i = l,. . . ,M, 3u such that 
Sxe5 Am(^) > 0- Moreover, the optimal solution v* to RLP is independent of the cost vector c and 
any other feasible solution v to RLP dominates v* . 

Proof of Lemma\^ Consider the Langrangian dual problem to RLP, 



LDip) := min \c^v-^Y^ nl{a 



- Ru{x) - X'^pivifix, u, Yi)) 



1=0 



Let (piv,^) = c^v - J2t^uJ2xes, l^li^) [^(0 - Ruix) - X^'ilQPiv{f{x,u,Yi))]. Let 7" be the feasible 
set for RLP and let /"(/i) be the feasible set of SLP{fi). Then, we have, 

LD{ii) := min (l){v , fi) 

V 

< min (j)(v,n) 

veSLP(^l) 

< min (Fv = SLPia). 

veSLP{p) 

Since C J~{pi) for every ^, it readily follows that SLP{^) < RLP. Also, RLP is feasible. For eg., 
consider the feasible solution v given by, 

v{i) = ™^."-^"(^) ^vi G {1, . . . , M}. 
1 — A 
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Moreover, any feasible v satisfies, 

v{i) > — ^^^-^yi G {1, . . . ,M}. 

So, RLP is also bounded from below and hence it satisfies the requirements of strong duality for 
LPs. Hence, there exists a /x which is optimal for the dual of RLP and also satisfies LD{p) = RLP. 
Therefore, the same Ji must also be such that SLP{ji) = RLP. Now for every partition index 
? = !,..., M, there exists at least one u for which Yl,x£S flui^) > 0. If for some i, p.l^{x) = for 
every x ^ Si and for every u, then SLP{fl) will not have any constraints lower bounding v{i). It 
will then admit solutions for v{i) that are arbitrarily negative and correspondingly, one can find a 
direction in which the cost of SLP{ji) decreases without bound. However, this is a contradiction, 
since RLP is lower bounded. So, we can rewrite SLP{fl) in the following manner: 

SLP{jl) := minc^v, subject to (EC. 9) 

^ m 

v{i) > r^{i) + X^ J]/i:.(x)^p,^;(/(x,n,>^)), £U,,i = I, . . . , M, 

2^xeSif^u[x) ^^^^ 

where, uGUt if Xla^gs /^u(^) ^ 0- Clearly, SLP{jl) is the exact LP corresponding to a MDP of 
reduced dimension with one-step reward function, 

ru[i) = —r-^ — . yueUi, 



and transition probability matrix P„ given by, 
PuiiJ) :-- 



0, otherwise. 
So, by Lemma [21 the optimal solution v* is also the optimal value function associated with the 
same underlying MDP. Also any feasible v to RLP is also a feasible solution to SLP(p,) since the 
constraints for SLP{ji) are obtained by a convex combination of the constraints of RLP. So, it 
follows from Lemma [J that v>v*. 

Finally, let RLP{c) and RLP{d) denote the restricted LPs corresponding to two different cost 
vectors c and d respectively. Let the corresponding optimal solutions be v*^ and f^. Since v*i is a 
feasible solution for RLP{c), we have v^>v*. By the same token, v* > v*^. Hence, v* = v}. □ 
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EC. 4. Proofs to claims in Section 14.2 



Claim [TJ If Xi>X2, then for the same sequence of inputs Uf and disturbances yt, the system state 
evolves in such a way that Xi, u^, y^) > x{t; X2,Ut,yt) for every t> 0. 

Proof of ClaimUl We use induction. Clearly at t = 0, Xi > X2- By the semi-group property of 
state transitions, it is sufficient to show that the result holds for t = l. We define the state, x, of 
the patrol system to be of two types. If the following holds: 

£x^^-, dx = 0, Ae^^x = ^, Si'n.d Aj^x = i-, for some j € fJ, j 7^ £2;, (EC. 10) 

i.e., the UAV is at a station with an alert, the dwell time is zero and also there is an alert at some 
other station, then the state x is of type 1. Else it is of type 2. Note that if Xi > X2, then the states 
xi and X2 are necessarily of the same type. The key property we will be using in proving Claim [T] is 
the following: service delay at a station either remains at zero (if no new alert has occurred there) 
or it goes up by 1 (if there is an unserviced alert there) or it is reset to zero (if a UAV decides to 
loiter there). 

If Xi and X2 are of type 1 and the UAV chooses to loiter, i.e., n(0) = 0, we clearly see that neither 
the location nor the dwell will differ at t = l. Furthermore, the delays at t = 1 associated with the 
stations corresponding to initial state Xi will be no less than the delays associated with stations 
corresponding to initial state X2 since Xi >X2. If Zi = x(l; Xi, 0, y(0)) and Z2 = x(l; X2, 0, y(0)), we 
see that i^-^ = l^^, d^^ = d^^, uj^-^ = uj^^, and r^-.^j > r^.^j, Vj S ^ for every disturbance y(0) and so 
Zi> Z2- The same relationship holds for other possible control choices, ti(0) ^ 0, as well. By a 
similar argument, one can show that x(l; Xi, ti(0), y(0)) > x(l, X2, ti(0), y(0)) holds, regardless of the 
control choice, even if the states of type 2. We use the semi- group property as follows: 

suppose the claim holds for all t lying between and / for some / > 0. Then, we will treat the state 
at t = / as the initial condition for determining the evolution of the state at t = l + \. The clock is 
reset as: i = t — I, t>l. By the preceding arguments. Claim [1] holds for i =1 which is equivalent 
to saying that it holds for t = l + 1. □ 
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Claim [2j // xi > x^, then V*{x]) < V*{x2). Furthermore, if > Sj, then mmj;^^. < 
min^g^^. V*{z). 

Proof of Claim\^ Let vr* be the optimal policy; accordingly tt*{x) is fixed for every x G 5. Then, 
for every t > 0, we can determine Xi, , y^) for some sequence of disturbances yt, where the 
optimal input sequence Uj = {ii*(0), . . . ■,u*{t — 1)} (starting with Xi) can be recursively obtained 
as follows: 

u*{t) = T:*{x{t-l;xuK-i.Yt-i))- (ECU) 

with the initialization m*(0) = 7r*(xi). For the above u* and y, we can then determine the evolution 
of the states corresponding to initial state X2- Since j;(t; Xi, , y^) > x(t; X2, Uj , y^) by Claim[Tl we 
notice readily that the reward (x(t; Xi, Uj , y^)) < (x(t; X2, Uj , y^)) for every t > (since the 
one-step reward is based only on the maximum delay, dwell time and control input, the inequality 
follows). Since the above holds for any given disturbance sequence, the expected discounted payoff 
associated with the state starting from Xi i.e., V*{xi), is no more than the expected discounted 
payoff associated with the state starting from X2, which we will denote by Vu*(x2). As a result, 
y*{xi) < Vu*{x2) < V*{x2). The second part of the inequality holds since u* as defined in (lEC.lip 
is a sub-optimal control policy for the state evolution starting from X2 and hence the expected 
discounted payoff associated with that policy is necessarily dominated by the optimal value function 
starting from X2. To complete the proof, consider two different partitions Si and Sj such that 
Si > Sj. Let z = argmin^g5_^, V*{z) and this can always be found since we are dealing with a subset, 
Sj of a finite state space S. Since Si > Sj, 3x G Si such that x > z. We have shown that for this 
case, V* (x) <V*{z)= min^^^ . V* (z) =^ min^g^. V* (x) < min^g^^. V* (z) . □ 
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